library(tidyverse)
library(caret)
library(nnet)
library(plotly)

Statistical modeling and prediction

1. Effect of sea water temperature on fish in the Great Barrier Reef

1.1 Data exploratory analysis and visualization

In this analysis, we will examine the effects of water temperature on the number of unique fish species observed in the Great Barrier Reef from 1997 to 2011.

Originally, we wanted to examine a relationship between water temperature and coral cover. However, we quickly saw that, when visualized, there does not seem to be a linear relation and we did not want to fit a misspecified model.

Merge between temperature and coral cover data sets

temp <- read_csv("cleaned_data/aims_temperatures.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   avg_water_temp_2.0m_flat_site = col_double(),
##   avg_water_temp_9.0m_slope_site = col_double()
## )
coral_cover <- read_csv("cleaned_data/coral_cover.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   mean_live_coral_cover_percent = col_double(),
##   lower_conf_int = col_double(),
##   upper_conf_int = col_double(),
##   conf_int_span = col_double()
## )
temp_coral_cover <- merge(x=temp, y=coral_cover, by="date")
temp_coral_cover %>%
    ggplot(aes(avg_water_temp_2.0m_flat_site, mean_live_coral_cover_percent)) + 
    geom_point()

temp_coral_cover %>%
    ggplot(aes(avg_water_temp_9.0m_slope_site, mean_live_coral_cover_percent)) + 
    geom_point()

So, we will examine a relationship between water temperature and the number of unique species observed in the Great Barrier Reef. First, we can do some basic visualization by plotting the relationship between water temperatures at 2m and 9m with number of unique fish species observed.

fish <- read_csv("cleaned_data/fish_species_counts.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   num_of_species = col_double()
## )
fish_temp <- merge(x=temp, y=fish, by="date")

# water temp at 2.0m
fish_temp %>% ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + geom_point()

# water temp at92.0m
fish_temp %>% ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + geom_point()

There seems to be a bit of negative linear relationship going on, so we will fit a linear model examining the number of unique species discovered in the Great Barrier Reef in relation to rising temperature.

1.2 Linear regression

Now, we can split our data set into train and test sets, using 0.6 to partition our data. Our outcome is the mean coral cover percentage.

train_index <- createDataPartition(y=fish_temp$num_of_species, times=1, p = 0.6, list=FALSE)

train_set <- fish_temp[train_index, ]
test_set <- fish_temp[-train_index, ]

First, we will fit two models using each temperature at different depths as a single covariate, and then we will use both predictors to create a multiple linear regression model using both water temperatures as our covariates.

fish_temp_2.0m <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site, data=train_set)
summary(fish_temp_2.0m)
## 
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.346  -9.001   1.969   9.969  39.187 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   135.7093    17.4053   7.797 3.25e-13 ***
## avg_water_temp_2.0m_flat_site  -2.6880     0.6351  -4.232 3.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.25 on 203 degrees of freedom
## Multiple R-squared:  0.08109,    Adjusted R-squared:  0.07656 
## F-statistic: 17.91 on 1 and 203 DF,  p-value: 3.5e-05
fish_temp_9.0m <- lm(num_of_species ~ avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp_9.0m)
## 
## Call:
## lm(formula = num_of_species ~ avg_water_temp_9.0m_slope_site, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.294  -9.189   2.220   9.840  39.588 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    136.8860    17.6544   7.754 4.23e-13 ***
## avg_water_temp_9.0m_slope_site  -2.7402     0.6464  -4.239 3.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.24 on 203 degrees of freedom
## Multiple R-squared:  0.08133,    Adjusted R-squared:  0.0768 
## F-statistic: 17.97 on 1 and 203 DF,  p-value: 3.404e-05

Using both temperatures as our covariates

fish_temp <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site + avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp)
## 
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site + 
##     avg_water_temp_9.0m_slope_site, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.306  -9.151   2.192   9.864  39.471 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    136.6510    17.8368   7.661 7.53e-13 ***
## avg_water_temp_2.0m_flat_site   -0.7927     7.5044  -0.106    0.916    
## avg_water_temp_9.0m_slope_site  -1.9362     7.6388  -0.253    0.800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.28 on 202 degrees of freedom
## Multiple R-squared:  0.08138,    Adjusted R-squared:  0.07228 
## F-statistic: 8.948 on 2 and 202 DF,  p-value: 0.0001891

Interestingly, using both water temperatures does not result in a regression model where the covariates are statistically significant predictors, as seen in the p-values of 0.731 and 0.413. So, we will compare between the two simple linear models to assess which water temperature depth is a better predictor of unique fish species observed in the Great Barrier Reef. Let’s make predictions on our test data and assess model performance between the two models using 2.0m temperature vs 9.0m temperature.

pred_2.0m <- predict(fish_temp_2.0m, test_set)
pred_9.0m <- predict(fish_temp_9.0m, test_set)

postResample(pred = pred_2.0m, obs = test_set$num_of_species)
##        RMSE    Rsquared         MAE 
## 15.05511231  0.06498576 11.58183463
postResample(pred = pred_9.0m, obs = test_set$num_of_species)
##       RMSE   Rsquared        MAE 
## 14.9955378  0.0717282 11.5291630

We can assess this visually to confirm our results.

# water temp at 2.0m
test_set %>% 
    ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + 
    geom_point() +
    geom_abline(intercept=fish_temp_2.0m$coefficients[1], slope=fish_temp_2.0m$coefficients[2], col="red")

# water temp at 9.0m
test_set %>% 
    ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + 
    geom_point() +
    geom_abline(intercept=fish_temp_9.0m$coefficients[1], slope=fish_temp_9.0m$coefficients[2], col="blue")

They both perform very similarly, and choosing either water temperature as our predictor will yield similar results.



2. Binary classification on predicting presence of Halophila Ovalis in the Great Barrier Reef from 1999 - 2003.

2.1 Data exploratory analysis and visualization

seagrass <- read.csv("cleaned_data/seagrass_classification_data.csv", as.is =TRUE)

seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)

head(seagrass)
summary(seagrass)
##        SPECIES         LATITUDE        LONGITUDE         DEPTH        
##  C_SERRULAT: 1256   Min.   :-24.20   Min.   :142.5   Min.   : 0.0000  
##  S_ISOETIFO:  101   1st Qu.:-23.79   1st Qu.:146.8   1st Qu.: 0.0000  
##  T_HEMPRICH:  823   Median :-22.41   Median :150.5   Median : 0.0000  
##  Z_CAPIRCOR:10201   Mean   :-21.06   Mean   :149.0   Mean   : 0.9253  
##                     3rd Qu.:-19.17   3rd Qu.:151.3   3rd Qu.: 1.2279  
##                     Max.   :-10.75   Max.   :151.9   Max.   :29.3583  
##                                                                       
##    SEDIMENT           TIDAL     
##  Gravel:   1   intertidal:7433  
##  Mud   :8969   subtidal  :4948  
##  Reef  :  63                    
##  Rock  :  66                    
##  Rubble: 107                    
##  Sand  :3151                    
##  Shell :  24

Data visualization

seagrass %>%  ggplot() + geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES))

seagrass %>% 
    ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")

plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.



2.2 Random forest

First, we partition our data set into train and test sets. Since we have a lot more data here than in the linear regression model, we will partition it by 75%-25%.

# test-train split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)

train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, 
                       data=train_set,
                       mtry = 2)
rf_fit
## 
## Call:
##  randomForest(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH +      SEDIMENT + TIDAL, data = train_set, mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.6%
## Confusion matrix:
##            C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR class.error
## C_SERRULAT        716         20         10        196  0.23991507
## S_ISOETIFO         34         29          5          8  0.61842105
## T_HEMPRICH         49          0        556         13  0.10032362
## Z_CAPIRCOR         85          5          2       7559  0.01202457
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
## 
##             true
## pred         C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        232          5         16         26
##   S_ISOETIFO          6         11          0          1
##   T_HEMPRICH          7          3        186          3
##   Z_CAPIRCOR         69          6          3       2520
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9531          
##                  95% CI : (0.9451, 0.9603)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8409          
##                                           
##  Mcnemar's Test P-Value : 4.587e-05       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.73885          0.440000           0.90732
## Specificity                    0.98309          0.997719           0.99550
## Pos Pred Value                 0.83154          0.611111           0.93467
## Neg Pred Value                 0.97087          0.995449           0.99344
## Prevalence                     0.10149          0.008080           0.06626
## Detection Rate                 0.07498          0.003555           0.06012
## Detection Prevalence           0.09017          0.005818           0.06432
## Balanced Accuracy              0.86097          0.718860           0.95141
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9882
## Specificity                     0.8566
## Pos Pred Value                  0.9700
## Neg Pred Value                  0.9395
## Prevalence                      0.8242
## Detection Rate                  0.8145
## Detection Prevalence            0.8397
## Balanced Accuracy               0.9224

We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR. However, we got quite a low sensitity for S_ISOETIFO. Recall to our data wrangling portion that S_ISOETIFO had only about 100 “Yes” observations. Since we had a small sample size for S_ISOETIFO relative to the other 3 seagrass species, this may have contributed to the low sensitivity.

Visualization

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicetd values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")
variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
tmp

We see that longitude and latitude were very predictive of presence of seagrass followed by depth from sea level. The types of sediment and seabed (intertidal or subtidal seabed) are not very good predictors. Thus, it seems that the location of where the seagrass was discovered matters more than the various ocean floor properties.

tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) + 
  geom_bar(stat='identity') +
  coord_flip() + xlab("Feature") +
  theme(axis.text=element_text(size=8))



2.3 Multinomial logistic regression

We can try a multinomial logistic regression model to see how it compares to our random forest model.

The logistic regression model is as follows:

library(nnet)

multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights:  44 (30 variable)
## initial  value 12874.515732 
## iter  10 value 3289.126720
## iter  20 value 2785.642593
## iter  30 value 2553.108055
## iter  40 value 2493.529943
## iter  50 value 2473.925457
## iter  60 value 2441.472833
## iter  70 value 2437.264881
## iter  80 value 2437.190386
## iter  90 value 2435.636816
## iter 100 value 2434.782657
## final  value 2434.782657 
## stopped after 100 iterations
summary(multinom_fit)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, 
##     data = train_set)
## 
## Coefficients:
##            (Intercept)  LATITUDE LONGITUDE       DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO   -377.5469 2.1258044  2.715572  0.01080721   14.837481    15.705391
## T_HEMPRICH   -151.4488 1.8022690  1.214187 -0.36056832    3.572568     6.256921
## Z_CAPIRCOR   -320.0220 0.6707213  1.481874 -0.73169016  117.594272    14.403740
##            SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO    15.586743     -44.668561    15.361435    -32.600781
## T_HEMPRICH     7.266048       7.602532     6.577231      6.079333
## Z_CAPIRCOR   114.450738     113.309648   116.642682    116.268956
## 
## Std. Errors:
##             (Intercept)   LATITUDE   LONGITUDE      DEPTH SEDIMENTMud
## S_ISOETIFO 0.0009111163 0.05681901 0.007533912 0.02450639   0.3211980
## T_HEMPRICH 0.0012850217 0.05534792 0.007082390 0.04263606   0.3094183
## Z_CAPIRCOR 0.0027829477 0.02881488 0.004610251 0.02773103   0.3498394
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 5.537842e-01    0.6471808            NaN    0.3020870  2.286341e-15
## T_HEMPRICH 5.409474e-01    0.5819349      0.4805512    0.2842374  1.061233e+00
## Z_CAPIRCOR 2.480399e-44    0.6777043      1.0186151    0.3516580  8.069902e-01
## 
## Residual Deviance: 4869.565 
## AIC: 4929.565

Relative risk ratios where reference group is C_SERRULAT

exp(coef(multinom_fit))
##              (Intercept) LATITUDE LONGITUDE     DEPTH  SEDIMENTMud SEDIMENTReef
## S_ISOETIFO 1.080137e-164 8.379635 15.113256 1.0108658 2.778664e+06 6618579.0049
## T_HEMPRICH  1.685077e-66 6.063390  3.367554 0.6972799 3.560792e+01     521.6106
## Z_CAPIRCOR 1.038018e-139 1.955647  4.401186 0.4810952 1.176368e+51 1800797.8701
##            SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 5.878094e+06   3.987407e-20 4.692307e+06  6.944816e-15
## T_HEMPRICH 1.430884e+03   2.003261e+03 7.185473e+02  4.367378e+02
## Z_CAPIRCOR 5.073689e+49   1.620894e+49 4.542273e+50  3.125835e+50
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")

# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")

confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        134          7         11         24
##   S_ISOETIFO          0          1          1          0
##   T_HEMPRICH         11          1        184         26
##   Z_CAPIRCOR        169         16          9       2500
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9111          
##                  95% CI : (0.9005, 0.9209)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.673           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.42675         0.0400000           0.89756
## Specificity                    0.98489         0.9996742           0.98685
## Pos Pred Value                 0.76136         0.5000000           0.82883
## Neg Pred Value                 0.93831         0.9922380           0.99269
## Prevalence                     0.10149         0.0080802           0.06626
## Detection Rate                 0.04331         0.0003232           0.05947
## Detection Prevalence           0.05688         0.0006464           0.07175
## Balanced Accuracy              0.70582         0.5198371           0.94220
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9804
## Specificity                     0.6434
## Pos Pred Value                  0.9280
## Neg Pred Value                  0.8750
## Prevalence                      0.8242
## Detection Rate                  0.8080
## Detection Prevalence            0.8707
## Balanced Accuracy               0.8119

We see that our multinomial logistic model has about 91% overall accuracy, which performs a bit worse than random forest. However, the model does a very bad job in predicting S_ISOETIFO.

The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. The model also does not perform well for sensitivity of C_SERRULAT, with only about 41.7% sensitivity.

density_plot_long_true <- ggplot(test_set, aes(x=LONGITUDE, fill=SPECIES)) + 
  geom_density(alpha=0.7)

density_plot_long_true

density_plot_long_pred <- ggplot(test_set, aes(x=LONGITUDE, fill=predicted_class)) + 
  geom_density(alpha=0.7)

density_plot_long_pred

density_plot_lat_true <- ggplot(test_set, aes(x=LATITUDE, fill=SPECIES)) + 
  geom_density(alpha=0.7)

density_plot_lat_true

density_plot_lat_pred <- ggplot(test_set, aes(x=LATITUDE, fill=predicted_class)) + 
  geom_density(alpha=0.7)

density_plot_lat_pred

density_plot_depth_true <- ggplot(test_set, aes(x=DEPTH, fill=SPECIES)) + 
  geom_density(alpha=0.7)

density_plot_depth_true

density_plot_depth_pred <- ggplot(test_set, aes(x=DEPTH, fill=predicted_class)) + 
  geom_density(alpha=0.7)

density_plot_depth_pred

So, we see that the overall accuracy for multinomial logistic regression model vs random forest model was 73.5% and 91.7%, respectively.